Automatic Error Detection in Annotated Corpora

نویسنده

  • Dipti Misra Sharma
چکیده

Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomalies creep in due to human errors and sometimes because of multiple interpretations of the annotation guidelines. Maintaining the quality of the annotations is a challenging problem. This is because validating the annotated corpora and correcting these errors manually is an expensive and time consuming process. In particular, the validation process needs an expert’s time to detect and correct these errors, which is expensive. Hence, they need intelligent tools to automatically detect possible instances of errors in annotated corpora which they can validate quickly. Treebank annotation involves encoding information at POS, morph, chunk and dependency levels. Annotation requires a domain specific understanding of the language and dependency guidelines. Further, to validate the annotated corpora, we need experts of language and annotation guidelines. In this work, we address the problem of treebank validation and proposed novel approaches to detect errors automatically. To be specific, we address the issues at dependency level in the annotation process which is more vulnerable to errors due to complex rules in the dependency annotation schema. In our solution, we used ensembling methods on the parsers outputs. We hypothesize that the annotation and validation process should go in parallel rather than waiting for the entire corpus to be created. Our tool provides annotators error instances or inconsistent cases, so that they can clear the ambiguities in their understanding by reflecting on these small numbers of error instances. This process helps in early understanding of the errors committed in the annotation process. We also address the problem of skewed data sets, which is common in Indian languages by utilizing word embedding. Later, we attempt to build tools to correct the dependency errors automatically. Our work majorly investigated the error detection using dependency parsers and able to detect errors with an F-score values 71.18% and 42.19% respectively for Hindi and Telugu treebanks available. Our work includes some preliminary attempts to correct the errors automatically and we have increased the baseline precision of corpus from 88.59% to 92.29% for Hindi treebank.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

POS error detection in automatically annotated corpora

Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data...

متن کامل

Terra: a Collection of Translation Error-Annotated Corpora

Recently the first methods of automatic diagnostics of machine translation have emerged; since this area of research is relatively young, the efforts are not coordinated. We present a collection of translation error-annotated corpora, consisting of automatically produced translations and their detailed manual translation error analysis. Using the collected corpora we evaluate the available stat...

متن کامل

An Annotated Corpus Management Tool: ChaKi

Large scale annotated corpora are very important not only in linguistic research but also in practical natural language processing tasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learningbased systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management t...

متن کامل

Feature-Rich Part-Of-Speech Tagging Using Deep Syntactic and Semantic Analysis

This paper describes the implementation, improvement and evaluation of the machine translation (MT) system proposed by Jackov (2014) when used as a feature-rich part-ofspeech (POS) tagger for Bulgarian. The system does not rely on POS tagging for morphological disambiguation. Instead, all ambiguities are considered in parsing hypotheses that are scored and the best one is used for tagging. The ...

متن کامل

Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

We present a method for automatically learning inflectional classes and associated lemmas from morphologically annotated corpora. The method consists of a core languageindependent algorithm, which can be optimized for specific languages. The method is demonstrated on Egyptian Arabic and German, two morphologically rich languages. Our best method for Egyptian Arabic provides an error reduction o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017